In compiler construction, name mangling (also called name decoration) is a technique used to solve various problems caused by the need to resolve unique names for programming entities in many modern programming languages.
It provides means to encode added information in the name of a function, structure, class or another data type, to pass more semantic information from the compiler to the linker.
The need for name mangling arises where a language allows different entities to be named with the same identifier as long as they occupy a different namespace (typically defined by a module, class, or explicit namespace directive) or have different (such as in function overloading). It is required in these uses because each signature might require different, specialized calling convention in the machine code.
Any object code produced by compilers is usually linked with other pieces of object code (produced by the same or another compiler) by a type of program called a linker. The linker needs a great deal of information on each program entity. For example, to correctly link a function it needs its name, the number of arguments and their types, and so on.
The simple programming languages of the 1970s, like C, only distinguished by their name, ignoring other information including parameter and return types. Later languages, like C++, defined stricter requirements for routines to be considered "equal", such as the parameter types, return type, and calling convention of a function. These requirements enable method overloading and detection of some bugs (such as using different definitions of a function when compiling different source code files). These stricter requirements needed to work with extant and conventions. Thus, added requirements were encoded in the name of the symbol, since that was the only information a traditional linker had about a symbol.
The mangling scheme for Windows was established by Microsoft and has been informally followed by other compilers including Digital Mars, Borland, and GNU Compiler Collection (GCC) when compiling code for the Windows platforms. The scheme even applies to other languages, such as Pascal, D, Delphi, Fortran, and C#. This allows subroutines written in those languages to call, or be called by, extant Windows libraries using a calling convention different from their default.
When compiling the following C examples:
32-bit compilers emit, respectively:
_f _g@4 @h@4
In the and mangling schemes, the function is encoded as _{{var|name}}@{{var|X}} and @{{var|name}}@{{var|X}} respectively, where is the number of bytes, in decimal, of the argument(s) in the parameter list (including those passed in registers, for fastcall). In the case of , the function name is merely prefixed by an underscore.
The 64-bit convention on Windows (Microsoft C) has no leading underscore. This difference may in some rare cases lead to unresolved externals when porting such code to 64 bits. For example, Fortran code can use 'alias' to link against a C method by name as follows:
This will compile and link fine under 32 bits, but generate an unresolved external _f under 64 bits. One workaround for this is not to use 'alias' at all (in which the method names typically need to be capitalized in C and Fortran). Another is to use the BIND option:
In C, most compilers also mangle static functions and variables (and in C++ functions and variables declared static or put in the anonymous namespace) in translation units using the same mangling rules as for their non-static versions. If functions with the same name (and parameters for C++) are also defined and used in different translation units, it will also mangle to the same name, potentially leading to a clash. However, they will not be equivalent if they are called in their respective translation units. Compilers are usually free to emit arbitrary mangling for these functions, because it is illegal to access these from other translation units directly, so they will never need linking between different object code (linking of them is never needed). To prevent linking conflicts, compilers will use standard mangling, but will use so-called 'local' symbols. When linking many such translation units there might be multiple definitions of a function with the same name, but resulting code will only call one or another depending on which translation unit it came from. This is usually done using the relocation mechanism.
The C++ language does not define a standard decoration scheme, so each compiler uses its own. C++ also has complex language features, such as classes, templates, namespaces, and operator overloading, that alter the meaning of specific symbols based on context or usage. Meta-data about these features can be disambiguated by mangling (decorating) the name of a symbol. Because the name-mangling systems for such features are not standardized across compilers, few linkers can link object code that was produced by different compilers.
These are distinct functions, with no relation to each other apart from the name. The C++ compiler will therefore encode the type information in the symbol name, the result being something resembling:
Even though its name is unique, is still mangled: name mangling applies to all C++ symbols (except for those in an
class article
{
public:
std::string format (); // = _ZN9wikipedia7article6formatEv
bool print_to (std::ostream&); // = _ZN9wikipedia7article8print_toERSo
class wikilink
{
public:
wikilink (std::string const& name); // = _ZN9wikipedia7article8wikilinkC1ERKSs
};
};
}
All mangled symbols begin with (note that an identifier beginning with an underscore followed by a capital letter is a reserved identifier in C, so conflict with user identifiers is avoided); for nested names (including both namespaces and classes), this is followed by , then a series of <length, id> pairs (the length being the length of the next identifier), and finally . For example, becomes:
_ZN9wikipedia7article6formatE
For functions, this is then followed by the type information; as is a function, this is simply ; hence:
_ZN9wikipedia7article6formatEv
For , the standard type (which is a for ) is used, which has the special alias ; a reference to this type is therefore , with the complete name for the function being:
_ZN9wikipedia7article8print_toERSo
extern "C" {
/* ... */
}
is to ensure that the symbols within are "unmangled" – that the compiler emits a binary file with their names undecorated, as a C compiler would do. As C language definitions are unmangled, the C++ compiler needs to avoid mangling references to these identifiers.
For example, the standard strings library, , usually contains something resembling:
void *memset (void *, int, size_t);
char *strcat (char *, const char *);
int strcmp (const char *, const char *);
char *strcpy (char *, const char *);
extern "C" {
}
Thus, code such as:
strcpy(a, argv[2]);
else
memset (a, 0, sizeof(a));
uses the correct, unmangled and . If the had not been used, the (SunPro) C++ compiler would produce code equivalent to:
__1cGstrcpy6Fpcpkc_0_(a, argv[2]);
else
__1cGmemset6FpviI_0_ (a, 0, sizeof(a));
Since those symbols do not exist in the C runtime library ( e.g. libc), link errors would result.
The C++ standard therefore does not attempt to standardize name mangling. On the contrary, the Annotated C++ Reference Manual (also known as ARM, , section 7.2.1c) actively encourages the use of different mangling schemes to prevent linking when other aspects of the ABI are incompatible.
Nevertheless, as detailed in the section above, on some platforms the full C++ ABI has been standardized, including name mangling.
It is good for safety purposes that compilers producing incompatible object codes (codes based on different ABIs, regarding e.g., classes and exceptions) use different name mangling schemes. This guarantees that these incompatibilities are detected at the linking phase, not when executing the software (which could lead to obscure bugs and serious stability issues).
For this reason, name decoration is an important aspect of any C++-related ABI.
There are instances, particularly in large, complex code bases, where it can be difficult or impractical to map the mangled name emitted within a linker error message back to the particular corresponding token/variable-name in the source. This problem can make identifying the relevant source file(s) very difficult for build or test engineers even if only one compiler and linker are in use. Demanglers (including those within the linker error reporting mechanisms) sometimes help but the mangling mechanism itself may discard critical disambiguating information.
int main() {
const char *mangled_name = "_ZNK3MapI10StringName3RefI8GDScriptE10ComparatorIS0_E16DefaultAllocatorE3hasERKS0_"; int status = -1; char *demangled_name = abi::__cxa_demangle(mangled_name, NULL, NULL, &status); printf("Demangled: %s\n", demangled_name); free(demangled_name); return 0;}
Output:
class bar {
public int x;
}
public void zark () {
Object f = new Object () {
public String toString() {
return "hello";
}
};
}
}
will produce three .class files:
All of these class names are valid (as $ symbols are permitted in the JVM specification) and these names are "safe" for the compiler to generate, as the Java language definition advises not to use $ symbols in normal java class definitions.
Name resolution in Java is further complicated at runtime, as fully qualified names for classes are unique only inside a specific classloader instance. Classloaders are ordered hierarchically and each Thread in the JVM has a so-called context class loader, so in cases where two different classloader instances contain classes with the same name, the system first tries to load the class using the root (or system) classloader and then goes down the hierarchy to the context class loader.
On encountering name mangled attributes, Python transforms these names by prepending a single underscore and the name of the enclosing class, for example:
myFunc name 'myFunc',
myProc name 'myProc';
Because of the case insensitivity, the name of a subroutine or function must be converted to a standardized case and format by the compiler so that it will be linked in the same way regardless of case. Different compilers have implemented this in various ways, and no standardization has occurred. The IBM AIX and HP-UX Fortran compilers convert all identifiers to lower case , while the Cray and Unicos Fortran compilers converted identifiers to all upper case . The GNU g77 compiler converts identifiers to lower case plus an underscore , except that identifiers already containing an underscore have two underscores appended , following a convention established by f2c. Many other compilers, including Silicon Graphics's (SGI) IRIX compilers, GNU Fortran, and Intel's Fortran compiler (except on Microsoft Windows), convert all identifiers to lower case plus an underscore ( and , respectively). On Microsoft Windows, the Intel Fortran compiler defaults to uppercase without an underscore.
Identifiers in Fortran 90 modules must be further mangled, because the same procedure name may occur in different modules. Since the Fortran 2003 Standard requires that module procedure names not conflict with other external symbols, compilers tend to use the module name and the procedure name, with a distinct marker in between. For example:
integer function five()
five = 5
end function five
end module m
In this module, the name of the function will be mangled as (e.g., GNU Fortran), (e.g., Intel's ifort), (e.g., Oracle's sun95), etc. Since Fortran does not allow overloading the name of a procedure, but uses generic interface blocks and generic type-bound procedures instead, the mangled names do not need to incorporate clues about the arguments.
The Fortran 2003 BIND option overrides any name mangling done by the compiler, as shown above.
Rust has used many versions of symbol mangling schemes that can be selected at compile time with an option. The following manglers are defined:
Examples are provided in the Rust tests.
+ (''return-type'') ''name''0:''parameter''0 ''name''1:''parameter''1 ... – (''return-type'') ''name''0:''parameter''0 ''name''1:''parameter''1 ...
Class methods are signified by +, instance methods use -. A typical class method declaration may then look like:
With instance methods looking like this:
Each of these method declarations have a specific internal representation. When compiled, each method is named according to the following scheme for class methods:
_c_''Class''_''name''0_''name''1_ ...
and this for instance methods:
_i_''Class''_''name''0_''name''1_ ...
The colons in the Objective-C syntax are translated to underscores. So, the Objective-C class method , if belonging to the class would translate as , and the instance method (belonging to the same class) would translate to .
Each of the methods of a class are labeled in this way. However, to look up a method that a class may respond to would be tedious if all methods are represented in this fashion. Each of the methods is assigned a unique symbol (such as an integer). Such a symbol is known as a selector. In Objective-C, one can manage selectors directly – they have a specific type in Objective-C – .
During compiling, a table is built that maps the textual representation, such as , to selectors (which are given a type ). Managing selectors is more efficient than manipulating the text representation of a method. Note that a selector only matches a method's name, not the class it belongs to: different classes can have different implementations of a method with the same name. Because of this, implementations of a method are given a specific identifier too, these are known as implementation pointers, and are also given a type, .
Message sends are encoded by the compiler as calls to the function, or one of its cousins, where is the receiver of the message, and determines the method to call. Each class has its own table that maps selectors to their implementations – the implementation pointer specifies where in memory the implementation of the method resides. There are separate tables for class and instance methods. Apart from being stored in the to lookup tables, the functions are essentially anonymous.
The value for a selector does not vary between classes. This enables polymorphism.
The Objective-C runtime maintains information about the argument and return types of methods. However, this information is not part of the name of the method, and can vary from class to class.
Since Objective-C does not support namespaces, there is no need for the mangling of class names (that do appear as symbols in generated binaries).
The mangled name for a method of a class in module is , for 2014 Swift. The components and their meanings are as follows:
Mangling for versions since Swift 4.0 is documented officially. It retains some similarity to Itanium.
|
|